Skip to content

feat(quantization): Add GPTQ n-bit quantization support #21551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

amitsrivastava78
Copy link
Collaborator

@amitsrivastava78 amitsrivastava78 commented Aug 6, 2025

This commit integrates the GPTQ (Generative Pre-trained Transformer Quantization) algorithm into Keras.

Key features include:

  • A new GPTQConfig for configuring quantization parameters.
  • Integration with base Keras models via a model.quantize() method.
  • Support for custom dataset and tested models (GPT-2, OPT, Bloom, gemma3 etc).
  • Includes unit tests to verify perplexity and model functionality post-quantization.
  • Added the colab for the same here

@github-actions github-actions bot added the Gemma Gemma model specific issues label Aug 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @amitsrivastava78, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented support for GPTQ (Generative Pre-trained Transformer Quantization) n-bit quantization within Keras. This feature allows for significantly reducing the memory footprint and improving the inference speed of large language models by quantizing their weights to lower precision, such as 4-bit. The integration provides a streamlined way to apply GPTQ to Keras models, enabling more efficient deployment on resource-constrained hardware while aiming to maintain model performance.

Highlights

  • New GPTQ Quantization Mode: I've introduced a new 'gptq' quantization mode, expanding the existing QUANTIZATION_MODES to support this advanced n-bit quantization technique.
  • Extended model.quantize() Method: The model.quantize() method has been updated to recognize and process the 'gptq' mode. This now requires a dedicated GPTQConfig object, which encapsulates all necessary parameters for the GPTQ algorithm, ensuring proper configuration and execution.
  • New Quantization Modules: I've added a new quantizers directory containing the core logic for GPTQ. This includes the GPTQ class for layer-specific operations, GPTQConfig for overall parameter management, gptqutils for data loading and layer-wise application, and quant for fundamental quantization functions.
  • Enhanced Testing for GPTQ: Comprehensive unit tests have been added and updated in model_test.py to validate the GPTQ implementation. These tests cover various scenarios, including different dataset types (in-memory, generator, and public datasets like WikiText2) and ensure the quantized models retain functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPTQ n-bit quantization support to Keras, a significant feature. The implementation is well-structured, separating configuration, core logic, and utilities. The changes include a new GPTQConfig, integration into model.quantize(), the GPTQ algorithm implementation, and corresponding tests. My review has identified a few areas for improvement: removing leftover debug code, fixing an inconsistent error message, using HTTPS for downloads, addressing some dead code in tests, and refactoring for maintainability by reducing code duplication. Additionally, there's a performance consideration regarding tensor concatenation within a loop that could be optimized.

@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2025

Codecov Report

❌ Patch coverage is 89.05325% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.79%. Comparing base (8c55abe) to head (37370e0).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/quantizers/gptq_quant.py 72.88% 8 Missing and 8 partials ⚠️
keras/src/quantizers/gptq.py 89.71% 6 Missing and 5 partials ⚠️
keras/src/quantizers/gptq_core.py 93.52% 4 Missing and 5 partials ⚠️
keras/api/_tf_keras/keras/quantizers/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21551      +/-   ##
==========================================
+ Coverage   82.75%   82.79%   +0.04%     
==========================================
  Files         567      571       +4     
  Lines       56471    56807     +336     
  Branches     8818     8883      +65     
==========================================
+ Hits        46730    47033     +303     
- Misses       7580     7597      +17     
- Partials     2161     2177      +16     
Flag Coverage Δ
keras 82.60% <89.05%> (+0.04%) ⬆️
keras-jax 63.93% <89.05%> (+0.14%) ⬆️
keras-numpy 58.03% <15.68%> (-0.26%) ⬇️
keras-openvino 34.54% <12.72%> (-0.14%) ⬇️
keras-tensorflow 64.37% <89.05%> (+0.15%) ⬆️
keras-torch 63.97% <89.05%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@JyotinderSingh JyotinderSingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I have left a few initial comments.

@JyotinderSingh
Copy link
Collaborator

It would be helpful to attach colabs to the PR description showing improvements over raw 4-bit quantization for models that this feature has been tested with.

Copy link
Collaborator

@hertschuh hertschuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! There is a lot going on.

This is just a first pass, mostly high level / API comments.

@amitsrivastava78
Copy link
Collaborator Author

It would be helpful to attach colabs to the PR description showing improvements over raw 4-bit quantization for models that this feature has been tested with.

Colab attached in the PR now

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! left a few comments.
The code here is lacking test coverage. The codecov report also suggests the same - 62.40409% of the code is missing test coverage.

Copy link
Collaborator

@hertschuh hertschuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the expectations is that after quantization, it should be possible to

  • run the model (using the quantized kernels)
  • save it (keeping it quantized)
  • reload it (keeping it quantized)
  • run it again (quantized).

And also

  • export it (which should trace it with the quantized kernels)

I don't see the hooks needed in Dense.quantized_call and EinsumDense.quantized_call and the variables to support that.


About the overall design:

  • GPTQConfig is the global config.
  • GPTQQuant is most importantly the "state" (quantized kernel), although it also has config and some logic to determine this state. Note that the logic to dequantize is separate.
  • GPTQ is the wrapper that connects together one layer with one GPTQQuant, so there are many GPTQ instances, and GPTQ is not where the core loop lives.

Let's find some names (or maybe even a different structure) that make it easier to follow this. I also find the splitting in 3 files (gptq.py, gptqquant.py, gptqutils.py`) a little hard to navigate.

self.group_size = group_size
self.symmetric = symmetric
self.act_order = act_order
self.quantization_method = "gptq"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the purpose of holding the information about group size, symmetric quantization is there or not and if act_order is needed, this will be used while doing the gptq quantization

Copy link
Collaborator

@hertschuh hertschuh Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, github makes this confusing because it always shows 4 lines of context.

I meant, what is self.quantization_method = "gptq" for?

It's never accessed and is implied by the fact that this is GPTQConfig.

return ops.multiply(scale, dequantized_x)


class GPTQQuant:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Quant stand for? Quantizer? Quantized(Kernel)? Quantization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Quant" stands for Quantization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call it GPTQQuantization? Let's limit abbreviation when there is ambiguity.

@amitsrivastava78 amitsrivastava78 dismissed JyotinderSingh’s stale review August 12, 2025 09:03

colab link shared with jyoptinder

This commit integrates the GPTQ (Generative Pre-trained Transformer Quantization) algorithm into Keras.

Key features include:
- A new `GPTQConfig` for configuring quantization parameters.
- Integration with base Keras models via a `model.quantize()` method.
- Support for multiple datasets (WIKITEXT2,PTB, C4,custom dataset) and tested models (GPT-2, OPT, Bloom,gemma3 etc).
- Includes unit tests to verify perplexity and model functionality post-quantization.
Copy link
Collaborator

@hertschuh hertschuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitsrivastava78

I didn't see a response to:

One of the expectations is that after quantization, it should be possible to

  • run the model (using the quantized kernels)
  • save it (keeping it quantized)
  • reload it (keeping it quantized)
  • run it again (quantized).
    And also
  • export it (which should trace it with the quantized kernels)

I don't see the hooks needed in Dense.quantized_call and EinsumDense.quantized_call and the variables to support that.

else:
# Test for valid cases where no error should occur
try:
model.quantize(mode, config=config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my PR level comment, can you add model.save, then reload and verify it's still quantized.

from keras.src.quantizers.gptq_core import quantize_model


@keras_export(["keras.GPTQConfig", "keras.quantizers.GPTQConfig"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should export as keras.GPTQConfig, very few things should be at the top level. Any reason to do that?

keras.quantizers.GPTQConfig alone works.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

on their activation's second-order information.
"""

W = ops.transpose(ops.cast(self.layer.kernel, "float32"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout the code, there's a lot of use of single-letter capital variables, which is against our code style. Variables should be lowercase, with hyphens if needed, and they should use reasonable descriptive names rather than single letters/

analyze the model's activations.
tokenizer: A `keras_nlp.Tokenizer` instance (or a similar callable)
that is used to process the `dataset` if it contains strings.
wbits (int, optional): The number of bits to quantize weights to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments names don't follow code style. They should:

  • Use hyphens, e.g. num_samples
  • Use "num" for number prefix
  • Avoid abbreviations unless the word being abbreviated is extremely obvious.

For instance, percdamp should probably be hessian_damping


@keras_export(["keras.GPTQConfig", "keras.quantizers.GPTQConfig"])
class GPTQConfig:
"""Configuration class for the GPTQ algorithm.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring should feature code examples.

class GPTQConfig:
"""Configuration class for the GPTQ algorithm.

This class holds all the parameters needed to apply the GPTQ method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain what the GPTQ method is and why/when a user should use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review Gemma Gemma model specific issues size:XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants